Skip to content

replace tl.libdevice.llrint with tl.extra.cuda.libdevice.rint#1372

Closed
bjmsong wants to merge 2 commits intobitsandbytes-foundation:mainfrom
bjmsong:develop
Closed

replace tl.libdevice.llrint with tl.extra.cuda.libdevice.rint#1372
bjmsong wants to merge 2 commits intobitsandbytes-foundation:mainfrom
bjmsong:develop

Conversation

@bjmsong
Copy link

@bjmsong bjmsong commented Sep 26, 2024

No description provided.

Copy link
Collaborator

@TimDettmers TimDettmers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: #1372 — Replace tl.libdevice.llrint with tl.extra.cuda.libdevice.rint

[bug-fix] Triton API migration: updates tl.libdevice.llrint calls in 3 Triton quantization kernels to use the new tl.extra.cuda.libdevice path. The tl.libdevice namespace was removed in Triton 3.x, so this addresses a real compatibility issue.

Blocking issues (2):

  1. Wrong function: rint instead of llrint — The PR replaces tl.libdevice.llrint with tl.extra.cuda.libdevice.rint, but these are semantically different functions. llrint rounds to the nearest integer and returns an integer type (long long). rint rounds to the nearest integer but returns a float. Since the result is being stored to an int8 output tensor, the practical difference may be minor (implicit float-to-int cast), but tl.extra.cuda.libdevice.llrint exists and is the correct 1:1 replacement. Using rint is an unnecessary semantic change that could introduce subtle numerical differences in edge cases (e.g., values exactly at 0.5 boundaries, NaN/Inf handling).

  2. No regression test — This is a bug fix that changes quantization kernel behavior. There should be a test that exercises the Triton quantization path and verifies correctness. The bitsandbytes/triton/ kernels are guarded behind is_triton_available(), so existing tests may not exercise them. A minimal test confirming quantize_rowwise, quantize_global, and quantize_columnwise_and_transpose produce correct output with the updated API would be valuable.

Additional note: This PR is from September 2024 and is significantly behind main. A rebase will likely be needed. The Triton directory has had substantial changes since then (XPU triton optimizers, compatibility guards, etc.).

  • Security: Clear (trivial API path change, no new imports, no suspicious patterns)
  • Downstream impact: None (internal Triton kernels, not part of public API)
  • Tests: Missing — no test covers the Triton quantization path change
  • CI: Not triggered (fork PR — maintainer must approve workflow run)
  • Serialization: Not affected
  • Cross-PR conflicts: None detected

abs_x = tl.abs(x)
max_val = tl.max(tl.where(row_mask, abs_x, 0), axis=0)
output = tl.libdevice.llrint(127.0 * (x / max_val))
output = tl.extra.cuda.libdevice.rint(127.0 * (x / max_val))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tl.extra.cuda.libdevice.rint is not the correct 1:1 replacement for tl.libdevice.llrint. llrint rounds to nearest integer and returns an integer type; rint rounds to nearest integer but returns a float. Since tl.extra.cuda.libdevice.llrint exists in modern Triton, this should use tl.extra.cuda.libdevice.llrint instead to preserve the original semantics.

x = tl.load(x_ptr + offsets, mask=mask)
absmax_inv = tl.load(absmax_inv_ptr)
output = tl.libdevice.llrint(127.0 * (x * absmax_inv))
output = tl.extra.cuda.libdevice.rint(127.0 * (x * absmax_inv))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: should be tl.extra.cuda.libdevice.llrint (not rint) to match the original semantics.

@matthewdouglas
Copy link
Member

Closing in favor of #1871 which removes this functionality instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants